Investigation and Reduction of Discretization Variance in Decision Tree Induction

نویسندگان

Pierre Geurts

Louis Wehenkel

چکیده

This paper focuses on the variance introduced by the dis-cretization techniques used to handle continuous attributes in decision tree induction. Diierent discretization procedures are rst studied empirically , then means to reduce the discretization variance are proposed. The experiment shows that discretization variance is large and that it is possible to reduce it signiicantly without notable computational costs. The resulting variance reduction mainly improves interpretability and stability of decision trees, and marginally their accuracy. Decision trees ((1], 2]) can be viewed as models of conditional class probability distributions. Top down tree induction recursively splits the input space into non overlapping subsets, estimating class probabilities by frequency counts based on learning samples belonging to each subset. Tree variance is the variability of its structure and parameters resulting from the randomness of the learning set; it translates into prediction variance yielding classiication errors. In regression models, prediction variance can be easily separated from bias, using the well-known bias/variance decomposition of the expected square error. Unfortunately, there is no such decomposition for the expected error rates of classiication rules (e.g. see 3, 4]). Hence, we will look at decision trees as multidimensional regression models for the conditional class probability distributions and evaluate their variance by the regression variance resulting from the estimation of these probabilities. Denoting by ^ P N (C i jx) the conditional class probability estimates given by a tree built from a random learning set of size N at a point x of the input space, we can write this variance (for one class C i) : (1) where the innermost expectations are taken over the set of all learning sets of size N and the outermost expectation is taken over the whole input space. Friedman 4] has studied the impact of this variance on classiication error rates, concluding to the greater importance of this term as compared to bias. Sources of Tree Variance. A rst (important) variance source is related to the need for discretizing continuous attributes by choosing thresholds. In

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparing different stopping criteria for fuzzy decision tree induction through IDFID3

Fuzzy Decision Tree (FDT) classifiers combine decision trees with approximate reasoning offered by fuzzy representation to deal with language and measurement uncertainties. When a FDT induction algorithm utilizes stopping criteria for early stopping of the tree's growth, threshold values of stopping criteria will control the number of nodes. Finding a proper threshold value for a stopping crite...

متن کامل

TREE AUTOMATA BASED ON COMPLETE RESIDUATED LATTICE-VALUED LOGIC: REDUCTION ALGORITHM AND DECISION PROBLEMS

In this paper, at first we define the concepts of response function and accessible states of a complete residuated lattice-valued (for simplicity we write $mathcal{L}$-valued) tree automaton with a threshold $c.$ Then, related to these concepts, we prove some lemmas and theorems that are applied in considering some decision problems such as finiteness-value and emptiness-value of recognizable t...

متن کامل

DIAGNOSIS OF BREAST LESIONS USING THE LOCAL CHAN-VESE MODEL, HIERARCHICAL FUZZY PARTITIONING AND FUZZY DECISION TREE INDUCTION

Breast cancer is one of the leading causes of death among women. Mammography remains today the best technology to detect breast cancer, early and efficiently, to distinguish between benign and malignant diseases. Several techniques in image processing and analysis have been developed to address this problem. In this paper, we propose a new solution to the problem of computer aided detection and...

متن کامل

Multi-interval Discretization Methods for Decision Tree Learning

Properly addressing the discretization process of continuos valued features is an important problem during decision tree learning. This paper describes four multi-interval discretization methods for induction of decision trees used in dynamic fashion. We compare two known discretization methods to two new methods proposed in this paper based on a histogram based method and a neural net based me...

متن کامل

A Comparision of Different Multi- Interval Discretization Methods for Decision Tree Learning

Properly addressing the discretization process of continous valued features is an important problem during decision tree learning. This paper describes four multi-interval discretization methods for induction of decision trees used in dynamic fashion. We compare two known discretization methods to two new methods proposed in this paper based on a histogram based method and a neural net based me...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2000

Investigation and Reduction of Discretization Variance in Decision Tree Induction

نویسندگان

چکیده

منابع مشابه

Comparing different stopping criteria for fuzzy decision tree induction through IDFID3

TREE AUTOMATA BASED ON COMPLETE RESIDUATED LATTICE-VALUED LOGIC: REDUCTION ALGORITHM AND DECISION PROBLEMS

DIAGNOSIS OF BREAST LESIONS USING THE LOCAL CHAN-VESE MODEL, HIERARCHICAL FUZZY PARTITIONING AND FUZZY DECISION TREE INDUCTION

Multi-interval Discretization Methods for Decision Tree Learning

A Comparision of Different Multi- Interval Discretization Methods for Decision Tree Learning

عنوان ژورنال:

اشتراک گذاری